A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# Installing the libraries with the specified version.
# !pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/INNHotelsGroup.csv')
dataset.head(10)
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
| 5 | INN00006 | 2 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 346 | 2018 | 9 | 13 | Online | 0 | 0 | 0 | 115.00000 | 1 | Canceled |
| 6 | INN00007 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 34 | 2017 | 10 | 15 | Online | 0 | 0 | 0 | 107.55000 | 1 | Not_Canceled |
| 7 | INN00008 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 4 | 83 | 2018 | 12 | 26 | Online | 0 | 0 | 0 | 105.61000 | 1 | Not_Canceled |
| 8 | INN00009 | 3 | 0 | 0 | 4 | Meal Plan 1 | 0 | Room_Type 1 | 121 | 2018 | 7 | 6 | Offline | 0 | 0 | 0 | 96.90000 | 1 | Not_Canceled |
| 9 | INN00010 | 2 | 0 | 0 | 5 | Meal Plan 1 | 0 | Room_Type 4 | 44 | 2018 | 10 | 18 | Online | 0 | 0 | 0 | 133.44000 | 3 | Not_Canceled |
# checking number of rows and columns
dataset.shape
(36275, 19)
# checking information, data types
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
# checking null values
dataset.isnull().sum()
| 0 | |
|---|---|
| Booking_ID | 0 |
| no_of_adults | 0 |
| no_of_children | 0 |
| no_of_weekend_nights | 0 |
| no_of_week_nights | 0 |
| type_of_meal_plan | 0 |
| required_car_parking_space | 0 |
| room_type_reserved | 0 |
| lead_time | 0 |
| arrival_year | 0 |
| arrival_month | 0 |
| arrival_date | 0 |
| market_segment_type | 0 |
| repeated_guest | 0 |
| no_of_previous_cancellations | 0 |
| no_of_previous_bookings_not_canceled | 0 |
| avg_price_per_room | 0 |
| no_of_special_requests | 0 |
| booking_status | 0 |
# checking duplicates
dataset.duplicated().sum()
0
# copying data to another variable to avoid any changes to original data
data = dataset.copy()
# booking is unique and irrelavant for analysis
data = data.drop('Booking_ID', axis=1)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.00000 | 1.84496 | 0.51871 | 0.00000 | 2.00000 | 2.00000 | 2.00000 | 4.00000 |
| no_of_children | 36275.00000 | 0.10528 | 0.40265 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 10.00000 |
| no_of_weekend_nights | 36275.00000 | 0.81072 | 0.87064 | 0.00000 | 0.00000 | 1.00000 | 2.00000 | 7.00000 |
| no_of_week_nights | 36275.00000 | 2.20430 | 1.41090 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 17.00000 |
| required_car_parking_space | 36275.00000 | 0.03099 | 0.17328 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| lead_time | 36275.00000 | 85.23256 | 85.93082 | 0.00000 | 17.00000 | 57.00000 | 126.00000 | 443.00000 |
| arrival_year | 36275.00000 | 2017.82043 | 0.38384 | 2017.00000 | 2018.00000 | 2018.00000 | 2018.00000 | 2018.00000 |
| arrival_month | 36275.00000 | 7.42365 | 3.06989 | 1.00000 | 5.00000 | 8.00000 | 10.00000 | 12.00000 |
| arrival_date | 36275.00000 | 15.59700 | 8.74045 | 1.00000 | 8.00000 | 16.00000 | 23.00000 | 31.00000 |
| repeated_guest | 36275.00000 | 0.02564 | 0.15805 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| no_of_previous_cancellations | 36275.00000 | 0.02335 | 0.36833 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 13.00000 |
| no_of_previous_bookings_not_canceled | 36275.00000 | 0.15341 | 1.75417 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.00000 |
| avg_price_per_room | 36275.00000 | 103.42354 | 35.08942 | 0.00000 | 80.30000 | 99.45000 | 120.00000 | 540.00000 |
| no_of_special_requests | 36275.00000 | 0.61966 | 0.78624 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 5.00000 |
data.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00000 | 0 | Not_Canceled |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68000 | 1 | Not_Canceled |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00000 | 0 | Canceled |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00000 | 0 | Canceled |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50000 | 0 | Canceled |
# function to create boxplot and histogram combined
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (10,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color='blue'
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="red", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, 'avg_price_per_room')
data[data["avg_price_per_room"] == 0]
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 2 | 2017 | 9 | 10 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| 145 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 13 | 2018 | 6 | 1 | Complementary | 1 | 3 | 5 | 0.00000 | 1 | Not_Canceled |
| 209 | 1 | 0 | 0 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2018 | 2 | 27 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| 266 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2017 | 8 | 12 | Complementary | 1 | 0 | 1 | 0.00000 | 1 | Not_Canceled |
| 267 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2017 | 8 | 23 | Complementary | 0 | 0 | 0 | 0.00000 | 1 | Not_Canceled |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35983 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 6 | 7 | Complementary | 1 | 4 | 17 | 0.00000 | 1 | Not_Canceled |
| 36080 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 3 | 21 | Complementary | 1 | 3 | 15 | 0.00000 | 1 | Not_Canceled |
| 36114 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 3 | 2 | Online | 0 | 0 | 0 | 0.00000 | 0 | Not_Canceled |
| 36217 | 2 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 2 | 3 | 2017 | 8 | 9 | Online | 0 | 0 | 0 | 0.00000 | 2 | Not_Canceled |
| 36250 | 1 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 6 | 2017 | 12 | 10 | Online | 0 | 0 | 0 | 0.00000 | 0 | Not_Canceled |
545 rows × 18 columns
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
| count | |
|---|---|
| market_segment_type | |
| Complementary | 354 |
| Online | 191 |
# Calculating the 25th and 75th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)
Q3 = data['avg_price_per_room'].quantile(0.75)
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
179.55
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
Observations on lead time
histogram_boxplot(data, 'lead_time')
histogram_boxplot(data, 'no_of_previous_cancellations')
Observation: most of the guests canceled one time previously, however there are outliers.
Observations on number of previous bookings not canceled
histogram_boxplot(data, 'no_of_previous_bookings_not_canceled')
Observation: very few customers not canceled more than one time previously
Observations on number of special requests
histogram_boxplot(data, 'no_of_special_requests')
Observation: Most of the customers no special requests, more speacial requests decrease the count
Observations on number of week nights
histogram_boxplot(data, 'no_of_week_nights')
Observation: The majority of values cluster around 2-3 week nights, as shown by the concentration of data, however there are week nights upto 17.
Observation on number of weekend nights
histogram_boxplot(data, 'no_of_weekend_nights')
Observation: Most of the customers are centered around 0 or 1 weekend night However there are customers upto 7 weekend nights
Observation on required car parking space
histogram_boxplot(data, 'required_car_parking_space')
Observation: most of the customers do not need car parking space
#Observation on repeated guests
histogram_boxplot(data, 'repeated_guest')
Observation: most of the customers are not repeated guests
Observation on number of adults
histogram_boxplot(data, 'no_of_adults')
Observation: Most bookings involve 2 adults, which is a common default value customers are couples most of the time.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, 'no_of_adults', perc=True)
Observation on number of aduts: 72% of the bookings were made for 2 adults, 21% of the bookings were made for 1 adult. Meaning, that the hotel may be an attractive place for couples and single travelers.
labeled_barplot(data, 'no_of_children', perc=True)
data['no_of_children'] = data['no_of_children'].replace([9, 10], 3)
labeled_barplot(data, 'no_of_weekend_nights', perc=True)
Observations:
labeled_barplot(data, 'no_of_week_nights', perc=True)
Observations:
# customers require car parking space
labeled_barplot(data, 'required_car_parking_space', perc=True)
Observation: 96.9% customers do not need car parking space.
# meal plan preferance
labeled_barplot(data, 'type_of_meal_plan', perc=True)
Observation: Most of the customers select "Meal Plan 1" which should be breakfast.
# Room type preferance
labeled_barplot(data, 'room_type_reserved', perc=True)
Observations:
# Arrival year distribution
labeled_barplot(data, 'arrival_year', perc=True)
Observation: most of the customers booked for 2018, may be there were more promotions in this year.
# Arrival month distribution
labeled_barplot(data, 'arrival_month', perc=True)
Observations:
# market segment type
labeled_barplot(data, 'market_segment_type', perc=True)
Observations:
Dominance of Online Bookings: The majority of customers prefer booking online, which is common in today’s digital landscape. This data suggests that maintaining and optimizing online booking systems should be a top priority.
Significant Offline Presence: Although online dominates, offline channels still capture a large portion of bookings, indicating that not all customers are ready or willing to use online platforms. Therefore, offering robust offline support (such as phone reservations, in-person bookings, or agent services) could still be valuable.
# repeated guests distribution
labeled_barplot(data, 'repeated_guest', perc=True)
Observation: 97.4% of guests are not repeated.
# Special requests
labeled_barplot(data, 'no_of_special_requests', perc=True)
Observation:
# booking status
labeled_barplot(data, 'booking_status', perc=True)
Observation:
# select numerical columns
cols_list = data.select_dtypes(include=np.number).columns.tolist()
# heatmap
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="coolwarm"
)
plt.show()
Observations:
Insights:
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
)
plt.tight_layout()
plt.show()
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(dataset, "market_segment_type", "booking_status")
booking_status Canceled Not_Canceled All market_segment_type All 11885 24390 36275 Online 8475 14739 23214 Offline 3153 7375 10528 Corporate 220 1797 2017 Aviation 37 88 125 Complementary 0 391 391 ------------------------------------------------------------------------------------------------------------------------
Observation: cancelled bookings are highest in online followed by offline, aviation and corporate. There is almost no cancelation in complimentry bookings. Offline and aviation cancellations are similar
stacked_barplot(dataset, "room_type_reserved", "booking_status")
booking_status Canceled Not_Canceled All room_type_reserved All 11885 24390 36275 Room_Type 1 9072 19058 28130 Room_Type 4 2069 3988 6057 Room_Type 6 406 560 966 Room_Type 2 228 464 692 Room_Type 5 72 193 265 Room_Type 7 36 122 158 Room_Type 3 2 5 7 ------------------------------------------------------------------------------------------------------------------------
Observation: Room type 6 has highest cancelations followed by room type 4 and 2
stacked_barplot(dataset, "type_of_meal_plan", "booking_status")
booking_status Canceled Not_Canceled All type_of_meal_plan All 11885 24390 36275 Meal Plan 1 8679 19156 27835 Not Selected 1699 3431 5130 Meal Plan 2 1506 1799 3305 Meal Plan 3 1 4 5 ------------------------------------------------------------------------------------------------------------------------
Observation: Meal plan 2 has high number of cancelations, I think hotels may have to improve lunch quality or free lunchs
stacked_barplot(dataset, "arrival_month", "booking_status")
booking_status Canceled Not_Canceled All arrival_month All 11885 24390 36275 10 1880 3437 5317 9 1538 3073 4611 8 1488 2325 3813 7 1314 1606 2920 6 1291 1912 3203 4 995 1741 2736 5 948 1650 2598 11 875 2105 2980 3 700 1658 2358 2 430 1274 1704 12 402 2619 3021 1 24 990 1014 ------------------------------------------------------------------------------------------------------------------------
Observation: July month has highest cancelations, followed by June and August
stacked_barplot(dataset, "repeated_guest", "booking_status")
booking_status Canceled Not_Canceled All repeated_guest All 11885 24390 36275 0 11869 23476 35345 1 16 914 930 ------------------------------------------------------------------------------------------------------------------------
Observations: Repeated guest are less likely to cancel bookings, they may be routine visitors
stacked_barplot(dataset, "required_car_parking_space", "booking_status")
booking_status Canceled Not_Canceled All required_car_parking_space All 11885 24390 36275 0 11771 23380 35151 1 114 1010 1124 ------------------------------------------------------------------------------------------------------------------------
Observation: People who require car parking are less likely to cancel booking.
distribution_plot_wrt_target(data, 'avg_price_per_room', 'booking_status')
Observation:
distribution_plot_wrt_target(data, 'no_of_previous_cancellations', 'booking_status')
distribution_plot_wrt_target(data, 'no_of_previous_bookings_not_canceled', 'booking_status')
distribution_plot_wrt_target(data, 'no_of_week_nights', 'booking_status')
Observation: Number of weeknights is relatively similar for both canceled and not canceled bookings, with a slight tendency for longer stays to have a higher likelihood of cancellation (as seen in the outliers). If you're aiming to reduce cancellations, especially for longer stays, targeted strategies (e.g., stricter cancellation policies or incentives for longer bookings) could help.
distribution_plot_wrt_target(dataset, 'no_of_weekend_nights', 'booking_status')
Observation: bookings with more weekend nights are slightly more prone to being canceled, especially those with 3 or more weekend nights. If reducing cancellations for longer weekend stays is a priority, you might consider offering incentives or flexible cancellation policies to retain those bookings.
distribution_plot_wrt_target(dataset, 'required_car_parking_space', 'booking_status')
distribution_plot_wrt_target(dataset, 'repeated_guest', 'booking_status')
Leading Questions:
# What are the busiest months in the hotel?
dataset['arrival_month'].value_counts()
| count | |
|---|---|
| arrival_month | |
| 10 | 5317 |
| 9 | 4611 |
| 8 | 3813 |
| 6 | 3203 |
| 12 | 3021 |
| 11 | 2980 |
| 7 | 2920 |
| 4 | 2736 |
| 5 | 2598 |
| 3 | 2358 |
| 2 | 1704 |
| 1 | 1014 |
plt.figure(figsize=(10, 6))
sns.countplot(data=dataset, x='arrival_month')
plt.title('Busiest Months in the Hotel')
plt.xlabel('Month')
plt.ylabel('Count')
plt.show()
Observation:
# Which market segment do most of the guests come from
dataset['market_segment_type'].value_counts()
| count | |
|---|---|
| market_segment_type | |
| Online | 23214 |
| Offline | 10528 |
| Corporate | 2017 |
| Complementary | 391 |
| Aviation | 125 |
plt.figure(figsize=(10, 6))
sns.countplot(data=dataset, x='market_segment_type')
plt.title('Market Segment Type')
plt.xlabel('Market Segment Type')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
Observation: Most of the guests come from online followed by offline, aviation guests are lowest as their staff numbers are not much it makes sense.
# Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
difference_in_room_prices = dataset.groupby('market_segment_type')['avg_price_per_room'].mean()
print(difference_in_room_prices)
market_segment_type Aviation 100.70400 Complementary 3.14176 Corporate 82.91174 Offline 91.63268 Online 112.25685 Name: avg_price_per_room, dtype: float64
Observations:
# What percentage of bookings are canceled?
percent_bookings_cancelled = dataset['booking_status'].value_counts(normalize=True) * 100
print(percent_bookings_cancelled)
booking_status Not_Canceled 67.23639 Canceled 32.76361 Name: proportion, dtype: float64
Observation: 32.76% bookings are cancelled, which is quite high needs further investigation
# Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
repeating_guests = dataset[dataset['repeated_guest'] == 1]
percent_repeating_guests_cancelled = repeating_guests['booking_status'].value_counts(normalize=True) * 100
print(percent_repeating_guests_cancelled)
booking_status Not_Canceled 98.27957 Canceled 1.72043 Name: proportion, dtype: float64
Observation: Repeating guests cancellation is very low 1.72%, they are relaible customers.
# Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
special_requirements = dataset[dataset['no_of_special_requests'] > 0]
percent_special_requirements_cancelled = special_requirements['booking_status'].value_counts(normalize=True) * 100
print(percent_special_requirements_cancelled)
booking_status Not_Canceled 79.75512 Canceled 20.24488 Name: proportion, dtype: float64
Observation: Special requirement guests are less prone to cancellation (20.24%).
dataset.isnull().sum()
| 0 | |
|---|---|
| Booking_ID | 0 |
| no_of_adults | 0 |
| no_of_children | 0 |
| no_of_weekend_nights | 0 |
| no_of_week_nights | 0 |
| type_of_meal_plan | 0 |
| required_car_parking_space | 0 |
| room_type_reserved | 0 |
| lead_time | 0 |
| arrival_year | 0 |
| arrival_month | 0 |
| arrival_date | 0 |
| market_segment_type | 0 |
| repeated_guest | 0 |
| no_of_previous_cancellations | 0 |
| no_of_previous_bookings_not_canceled | 0 |
| avg_price_per_room | 0 |
| no_of_special_requests | 0 |
| booking_status | 0 |
Observation: There are no missing values
numeric_columns = dataset.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(dataset[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations:
There are quite a few outliers in the data. However, we will not treat them as they are proper values
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
labeled_barplot(data, 'market_segment_type', perc=True)
labeled_barplot(data, 'booking_status', perc=True)
labeled_barplot(data, 'room_type_reserved', perc=True)
labeled_barplot(data, 'type_of_meal_plan', perc=True)
labeled_barplot(data, 'arrival_month', perc=True)
labeled_barplot(data, 'arrival_year', perc=True)
labeled_barplot(data, 'repeated_guest', perc=True)
labeled_barplot(data, 'required_car_parking_space', perc=True)
labeled_barplot(data, 'no_of_special_requests', perc=True)
labeled_barplot(data, 'no_of_week_nights', perc=True)
labeled_barplot(data, 'no_of_weekend_nights', perc=True)
labeled_barplot(data, 'no_of_previous_cancellations', perc=True)
num_cols = dataset.select_dtypes(include=np.number).columns
plt.figure(figsize=(12,7))
sns.heatmap(dataset[num_cols].corr(), annot=True, cmap='coolwarm', vmax=1, vmin=-1, fmt='.2f')
plt.show()
Observation after treating outliers: there is almost no change in the correlation between features
Both the cases are important as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity.
F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
data["booking_status"] = data["booking_status"].apply(lambda x: 1 if x == "Canceled" else 0)
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
X = data.drop(["booking_status"], axis=1)
y = data["booking_status"]
# adding constant
X = sm.add_constant(X)
X = pd.get_dummies(X, drop_first=True)
X = X.astype(int)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
X_train.head()
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 1 | 1 | 0 | 0 | 1 | 0 | 163 | 2018 | 10 | 15 | 0 | 0 | 0 | 115 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 26641 | 1 | 2 | 0 | 0 | 3 | 0 | 113 | 2018 | 3 | 31 | 0 | 0 | 0 | 78 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17835 | 1 | 2 | 0 | 2 | 3 | 0 | 359 | 2018 | 10 | 14 | 0 | 0 | 0 | 78 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 21485 | 1 | 2 | 0 | 0 | 3 | 0 | 136 | 2018 | 6 | 29 | 0 | 0 | 0 | 85 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5670 | 1 | 2 | 0 | 1 | 2 | 0 | 21 | 2018 | 8 | 15 | 0 | 0 | 0 | 151 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
y_train.head()
| booking_status | |
|---|---|
| 13662 | 1 |
| 26641 | 0 |
| 17835 | 0 |
| 21485 | 1 |
| 5670 | 1 |
from sklearn.linear_model import LogisticRegression
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Fri, 04 Oct 2024 Pseudo R-squ.: 0.3293
Time: 23:37:27 Log-Likelihood: -10792.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -921.5462 120.827 -7.627 0.000 -1158.362 -684.730
no_of_adults 0.1145 0.038 3.040 0.002 0.041 0.188
no_of_children 0.1579 0.062 2.542 0.011 0.036 0.280
no_of_weekend_nights 0.1074 0.020 5.432 0.000 0.069 0.146
no_of_week_nights 0.0403 0.012 3.281 0.001 0.016 0.064
required_car_parking_space -1.5942 0.138 -11.565 0.000 -1.864 -1.324
lead_time 0.0157 0.000 58.867 0.000 0.015 0.016
arrival_year 0.4555 0.060 7.607 0.000 0.338 0.573
arrival_month -0.0417 0.006 -6.447 0.000 -0.054 -0.029
arrival_date 0.0005 0.002 0.261 0.794 -0.003 0.004
repeated_guest -2.3481 0.617 -3.808 0.000 -3.557 -1.140
no_of_previous_cancellations 0.2665 0.086 3.108 0.002 0.098 0.435
no_of_previous_bookings_not_canceled -0.1726 0.152 -1.132 0.258 -0.471 0.126
avg_price_per_room 0.0188 0.001 25.454 0.000 0.017 0.020
no_of_special_requests -1.4688 0.030 -48.781 0.000 -1.528 -1.410
type_of_meal_plan_Meal Plan 2 0.1748 0.067 2.623 0.009 0.044 0.305
type_of_meal_plan_Meal Plan 3 17.3532 3982.075 0.004 0.997 -7787.371 7822.077
type_of_meal_plan_Not Selected 0.2772 0.053 5.225 0.000 0.173 0.381
room_type_reserved_Room_Type 2 -0.3582 0.131 -2.729 0.006 -0.615 -0.101
room_type_reserved_Room_Type 3 -0.0002 1.310 -0.000 1.000 -2.567 2.567
room_type_reserved_Room_Type 4 -0.2824 0.053 -5.307 0.000 -0.387 -0.178
room_type_reserved_Room_Type 5 -0.7175 0.209 -3.431 0.001 -1.127 -0.308
room_type_reserved_Room_Type 6 -0.9501 0.151 -6.276 0.000 -1.247 -0.653
room_type_reserved_Room_Type 7 -1.4000 0.294 -4.770 0.000 -1.975 -0.825
market_segment_type_Complementary -40.5859 5.63e+05 -7.21e-05 1.000 -1.1e+06 1.1e+06
market_segment_type_Corporate -1.1911 0.266 -4.477 0.000 -1.713 -0.670
market_segment_type_Offline -2.1931 0.255 -8.613 0.000 -2.692 -1.694
market_segment_type_Online -0.3934 0.251 -1.566 0.117 -0.886 0.099
========================================================================================================
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80596 | 0.63422 | 0.73954 | 0.68285 |
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(X_train).sort_values(by="VIF", ascending=False)
| feature | VIF | |
|---|---|---|
| 0 | const | 39490734.57350 |
| 27 | market_segment_type_Online | 71.17803 |
| 26 | market_segment_type_Offline | 64.11618 |
| 25 | market_segment_type_Corporate | 16.92839 |
| 24 | market_segment_type_Complementary | 4.50251 |
| 2 | no_of_children | 2.09343 |
| 13 | avg_price_per_room | 2.06016 |
| 22 | room_type_reserved_Room_Type 6 | 2.05560 |
| 10 | repeated_guest | 1.78355 |
| 12 | no_of_previous_bookings_not_canceled | 1.65200 |
| 7 | arrival_year | 1.43165 |
| 11 | no_of_previous_cancellations | 1.39569 |
| 6 | lead_time | 1.39537 |
| 20 | room_type_reserved_Room_Type 4 | 1.36326 |
| 1 | no_of_adults | 1.35066 |
| 8 | arrival_month | 1.27625 |
| 15 | type_of_meal_plan_Meal Plan 2 | 1.27330 |
| 17 | type_of_meal_plan_Not Selected | 1.27270 |
| 14 | no_of_special_requests | 1.24798 |
| 23 | room_type_reserved_Room_Type 7 | 1.11801 |
| 18 | room_type_reserved_Room_Type 2 | 1.10605 |
| 4 | no_of_week_nights | 1.09595 |
| 3 | no_of_weekend_nights | 1.06974 |
| 5 | required_car_parking_space | 1.03998 |
| 21 | room_type_reserved_Room_Type 5 | 1.02791 |
| 16 | type_of_meal_plan_Meal Plan 3 | 1.02526 |
| 9 | arrival_date | 1.00679 |
| 19 | room_type_reserved_Room_Type 3 | 1.00330 |
Observation:
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train.astype(float)[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Fri, 04 Oct 2024 Pseudo R-squ.: 0.3283
Time: 23:37:31 Log-Likelihood: -10809.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -914.2561 120.467 -7.589 0.000 -1150.368 -678.145
no_of_adults 0.1097 0.037 2.939 0.003 0.037 0.183
no_of_children 0.1530 0.062 2.468 0.014 0.031 0.275
no_of_weekend_nights 0.1093 0.020 5.535 0.000 0.071 0.148
no_of_week_nights 0.0423 0.012 3.445 0.001 0.018 0.066
required_car_parking_space -1.5947 0.138 -11.565 0.000 -1.865 -1.324
lead_time 0.0157 0.000 59.218 0.000 0.015 0.016
arrival_year 0.4516 0.060 7.565 0.000 0.335 0.569
arrival_month -0.0426 0.006 -6.598 0.000 -0.055 -0.030
repeated_guest -2.7378 0.557 -4.917 0.000 -3.829 -1.647
no_of_previous_cancellations 0.2289 0.077 2.982 0.003 0.078 0.379
avg_price_per_room 0.0192 0.001 26.393 0.000 0.018 0.021
no_of_special_requests -1.4697 0.030 -48.881 0.000 -1.529 -1.411
type_of_meal_plan_Meal Plan 2 0.1634 0.067 2.455 0.014 0.033 0.294
type_of_meal_plan_Not Selected 0.2848 0.053 5.385 0.000 0.181 0.388
room_type_reserved_Room_Type 2 -0.3528 0.131 -2.690 0.007 -0.610 -0.096
room_type_reserved_Room_Type 4 -0.2831 0.053 -5.334 0.000 -0.387 -0.179
room_type_reserved_Room_Type 5 -0.7350 0.208 -3.527 0.000 -1.143 -0.327
room_type_reserved_Room_Type 6 -0.9682 0.151 -6.404 0.000 -1.265 -0.672
room_type_reserved_Room_Type 7 -1.4341 0.293 -4.893 0.000 -2.009 -0.860
market_segment_type_Corporate -0.7960 0.103 -7.738 0.000 -0.998 -0.594
market_segment_type_Offline -1.7901 0.052 -34.451 0.000 -1.892 -1.688
==================================================================================================
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1.astype(float), y_train.astype(float))
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80557 | 0.63303 | 0.73918 | 0.68200 |
Coefficient interpretations:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1.astype(float), y_train.astype(float))
print("Training performance:")
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg1, X_train1.astype(float), y_train.astype(float)
)
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80557 | 0.63303 | 0.73918 | 0.68200 |
Let's check the performance on the test set
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1.astype(float), y_test.astype(float))
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg1, X_test1.astype(float), y_test.astype(float)
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80447 | 0.63118 | 0.72837 | 0.67630 |
Generalization: The metrics are very close between the training and test sets, which is a positive indicator of the model's generalization capability. It shows that the model is not overfitting to the training data, as there is no significant drop in performance on the test set.
Precision and Recall Trade-off: While precision and recall are both slightly lower in the test set, they remain balanced. The model is likely performing well at identifying both true positives and minimizing false positives, with only a minor decrease in performance on unseen data.
# converting coefficients to odds
odds = np.exp(lg1.params)
# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
| const | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.00000 | 1.11594 | 1.16534 | 1.11553 | 1.04319 | 0.20297 | 1.01584 | 1.57088 | 0.95834 | 0.06471 | 1.25716 | 1.01938 | 0.23000 | 1.17746 | 1.32950 | 0.70275 | 0.75348 | 0.47952 | 0.37976 | 0.23833 | 0.45111 | 0.16695 |
| Change_odd% | -100.00000 | 11.59416 | 16.53413 | 11.55280 | 4.31861 | -79.70273 | 1.58437 | 57.08751 | -4.16607 | -93.52885 | 25.71559 | 1.93804 | -77.00048 | 17.74556 | 32.94984 | -29.72500 | -24.65213 | -52.04826 | -62.02421 | -76.16674 | -54.88875 | -83.30520 |
Coefficient interpretations:
Interpretation for other attributes can be done similarly.
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.372598551491902
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79289 | 0.73419 | 0.66914 | 0.70015 |
Let's check the performance on the test set
logit_roc_auc_train = roc_auc_score(y_test.astype(float), lg1.predict(X_test1.astype(float)))
fpr, tpr, thresholds = roc_curve(y_test.astype(float), lg1.predict(X_test1.astype(float)))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1.astype(float), y_test.astype(float), threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1.astype(float), y_test.astype(float), threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79629 | 0.73822 | 0.66752 | 0.70109 |
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80139 | 0.69891 | 0.69833 | 0.69862 |
Let's check the performance on the test set
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1.astype(float), y_test.astype(float), threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1.astype(float), y_test.astype(float), threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80327 | 0.70216 | 0.69369 | 0.69790 |
Model performance summary
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80557 | 0.79289 | 0.80139 |
| Recall | 0.63303 | 0.73419 | 0.69891 |
| Precision | 0.73918 | 0.66914 | 0.69833 |
| F1 | 0.68200 | 0.70015 | 0.69862 |
# test performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80447 | 0.79629 | 0.80327 |
| Recall | 0.63118 | 0.73822 | 0.70216 |
| Precision | 0.72837 | 0.66752 | 0.69369 |
| F1 | 0.67630 | 0.70109 | 0.69790 |
We have been able to build a predictive model that can be used by the hotel to predict which bookings are likely to be cancelled with an F1 score of 0.69 on the training set and formulate marketing policies accordingly.
The logistic regression models are giving a generalized performance on training and test set.
Using the model with default threshold the model will give a low recall but good precision score - The hotel will be able to predict which bookings will not be cancelled and will be able to provide satisfactory services to those customers which help in maintaining the brand equity but will lose on resources.
Using the model with a 0.37 threshold the model will give a high recall but low precision score - The hotel will be able to save resources by correctly predicting the bookings which are likely to be cancelled but might damage the brand equity. Using the model with a 0.42 threshold the model will give a balance recall and precision score - The hotel will be able to maintain a balance between resources and brand equity.
Coefficients of required_car_parking_space, arrival_month, repeated_guest, no_of_special_requests and some others are negative, an increase in these will lead to a decrease in chances of a customer canceling their booking.
Coefficients of no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, lead_time, avg_price_per_room, type_of_meal_plan_Not Selected and some others are positive, an increase in these will lead to a increase in the chances of a customer canceling their booking.
Using model with default threshold
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1.astype(float), y_test.astype(float), threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80327 | 0.70216 | 0.69369 | 0.69790 |
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
log_reg_model_test_perf = model_performance_classification_statsmodels(lg1, X_test1, y_test)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80447 | 0.63118 | 0.72837 | 0.67630 |
Using model with threshold = 0.37
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_curve)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.79629 | 0.73822 | 0.66752 | 0.70109 |
Using model with threshold = 0.42
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.80327 | 0.70216 | 0.69369 | 0.69790 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80557 | 0.79289 | 0.80139 |
| Recall | 0.63303 | 0.73419 | 0.69891 |
| Precision | 0.73918 | 0.66914 | 0.69833 |
| F1 | 0.68200 | 0.70015 | 0.69862 |
# test performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.37 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Accuracy | 0.80447 | 0.79629 | 0.80327 |
| Recall | 0.63118 | 0.73822 | 0.70216 |
| Precision | 0.72837 | 0.66752 | 0.69369 |
| F1 | 0.67630 | 0.70109 | 0.69790 |
X = data.drop('booking_status', axis=1)
y = data['booking_status']
X = pd.get_dummies(X, drop_first=True)
X = X.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 27) Shape of test set : (10883, 27) Percentage of classes in training set: booking_status 0 0.67238 1 0.32762 Name: proportion, dtype: float64 Percentage of classes in test set: booking_status 0 0.67233 1 0.32767 Name: proportion, dtype: float64
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
dTree = DecisionTreeClassifier(random_state=1)
dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
print('Training data Accuracy:', dTree.score(X_train, y_train))
print('Test data Accuracy:', dTree.score(X_test, y_test))
Training data Accuracy: 0.9943683049779458 Test data Accuracy: 0.8635486538638243
## Function to create confusion matrix
from sklearn import metrics
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
make_confusion_matrix(dTree,y_test)
decision_tree_perf_train = model_performance_classification_sklearn(
dTree, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99437 | 0.98570 | 0.99708 | 0.99136 |
confusion_matrix_sklearn(dTree, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(dTree, X_test, y_test)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86355 | 0.79641 | 0.78911 | 0.79274 |
Before pruning the tree let's check the important features.
feature_names = list(X_train.columns)
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Pre-Pruning
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2,8,2), # Testing deeper trees in addition to current
"max_leaf_nodes": [50, 100, 150, 250, 500], # Adding more choices, including larger values
"min_samples_split": [10, 30, 50, 100], # Adjusting the range of splits
"min_samples_leaf": [1, 5, 10, 20], # Controlling leaf size, smaller values give finer trees
"criterion": ["gini", "entropy"], # Testing both Gini and Entropy for better splits
# "max_features": [None, "sqrt", "log2"], # Testing the number of features to consider at each split
"min_impurity_decrease": [0.0, 0.01, 0.05],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(dTree, X_train, y_train)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99437 | 0.98570 | 0.99708 | 0.99136 |
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83451 | 0.77987 | 0.73242 | 0.75540 |
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 1.50 | | | | | |--- avg_price_per_room <= 202.00 | | | | | | |--- weights: [2356.56, 293.02] class: 0 | | | | | |--- avg_price_per_room > 202.00 | | | | | | |--- weights: [1.49, 21.37] class: 1 | | | | |--- no_of_weekend_nights > 1.50 | | | | | |--- no_of_adults <= 1.50 | | | | | | |--- weights: [104.11, 157.19] class: 1 | | | | | |--- no_of_adults > 1.50 | | | | | | |--- weights: [350.25, 83.94] class: 0 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- arrival_month <= 10.50 | | | | | | |--- weights: [245.40, 512.78] class: 1 | | | | | |--- arrival_month > 10.50 | | | | | | |--- weights: [43.13, 4.58] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- avg_price_per_room <= 89.50 | | | | | | |--- weights: [169.55, 27.47] class: 0 | | | | | |--- avg_price_per_room > 89.50 | | | | | | |--- weights: [124.93, 108.36] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 9.50 | | | | |--- lead_time <= 2.50 | | | | | |--- avg_price_per_room <= 202.50 | | | | | | |--- weights: [408.25, 93.09] class: 0 | | | | | |--- avg_price_per_room > 202.50 | | | | | | |--- weights: [0.74, 16.79] class: 1 | | | | |--- lead_time > 2.50 | | | | | |--- arrival_month <= 9.50 | | | | | | |--- weights: [171.03, 228.92] class: 1 | | | | | |--- arrival_month > 9.50 | | | | | | |--- weights: [121.96, 21.37] class: 0 | | | |--- lead_time > 9.50 | | | | |--- arrival_year <= 2017.50 | | | | | |--- lead_time <= 60.50 | | | | | | |--- weights: [145.75, 27.47] class: 0 | | | | | |--- lead_time > 60.50 | | | | | | |--- weights: [37.93, 144.98] class: 1 | | | | |--- arrival_year > 2017.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [944.41, 3645.96] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [43.87, 3.05] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 100.50 | | | | | |--- no_of_week_nights <= 11.00 | | | | | | |--- weights: [699.01, 9.16] class: 0 | | | | | |--- no_of_week_nights > 11.00 | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | |--- lead_time > 100.50 | | | | | |--- lead_time <= 105.00 | | | | | | |--- weights: [5.21, 6.10] class: 1 | | | | | |--- lead_time > 105.00 | | | | | | |--- weights: [78.08, 13.74] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 6.50 | | | | | |--- weights: [661.83, 61.05] class: 0 | | | | |--- lead_time > 6.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- weights: [2618.32, 1460.52] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [137.57, 1.53] class: 0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 89.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1593.60, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_week_nights <= 9.50 | | | | | | |--- weights: [235.73, 62.57] class: 0 | | | | | |--- no_of_week_nights > 9.50 | | | | | | |--- weights: [0.00, 6.10] class: 1 | | | |--- lead_time > 89.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- weights: [180.70, 51.89] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- weights: [118.24, 99.20] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [69.16, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.50 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- weights: [5.21, 32.05] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- weights: [254.32, 68.68] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- weights: [5.95, 4.58] class: 0 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [1.49, 90.04] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.50 | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | |--- weights: [216.40, 399.85] class: 1 | | | | | |--- market_segment_type_Online > 0.50 | | | | | | |--- weights: [2.97, 312.86] class: 1 | | | | |--- avg_price_per_room > 82.50 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- weights: [20.08, 1106.46] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [4.46, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- arrival_date <= 30.50 | | | | | | |--- weights: [46.11, 12.21] class: 0 | | | | | |--- arrival_date > 30.50 | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | |--- lead_time > 180.50 | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | |--- weights: [14.87, 4.58] class: 0 | | | | | |--- market_segment_type_Online > 0.50 | | | | | | |--- weights: [17.85, 204.50] class: 1 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- no_of_special_requests <= 1.50 | | | | | | |--- weights: [107.83, 1.53] class: 0 | | | | | |--- no_of_special_requests > 1.50 | | | | | | |--- weights: [8.92, 6.10] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 99.50 | | | | | | |--- weights: [248.37, 137.35] class: 0 | | | | | |--- avg_price_per_room > 99.50 | | | | | | |--- weights: [0.00, 19.84] class: 1 | |--- avg_price_per_room > 100.50 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3101.13] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [24.54, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [40.16, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 8.00 | | | | | |--- weights: [2.97, 0.00] class: 0 | | | | |--- arrival_date > 8.00 | | | | | |--- weights: [6.69, 27.47] class: 1
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations from decision tree: *We can see that the tree has become simpler and the rules of the trees are readable.
The rules obtained from the decision tree can be interpreted as:
Bookings made more than 151 days before the date of arrival:
Bookings made under 151 days before the date of arrival:
If we want more complex then we can go in more depth of the tree
Total impurity of leaves vs efective alphas of prined tree
Cost Complexity Pruning
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | 0.00833 |
| 1 | -0.00000 | 0.00833 |
| 2 | 0.00000 | 0.00833 |
| 3 | 0.00000 | 0.00833 |
| 4 | 0.00000 | 0.00833 |
| ... | ... | ... |
| 1636 | 0.00880 | 0.32791 |
| 1637 | 0.00941 | 0.33732 |
| 1638 | 0.01253 | 0.34985 |
| 1639 | 0.03405 | 0.41794 |
| 1640 | 0.08206 | 0.50000 |
1641 rows × 2 columns
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 3 with ccp_alpha: 0.03404622796578144
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = [recall_score]
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=9.207759232287106e-05, class_weight='balanced',
random_state=1)
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91714 | 0.93148 | 0.83475 | 0.88047 |
confusion_matrix_sklearn(best_model, X_test, y_test)
decision_tree_post_perf_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86033 | 0.84212 | 0.75833 | 0.79803 |
Observations:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 1.50 | | | | | |--- avg_price_per_room <= 202.00 | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 87.00 | | | | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | | | | |--- weights: [54.29, 0.00] class: 0 | | | | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- avg_price_per_room > 87.00 | | | | | | | | | | |--- lead_time <= 1.50 | | | | | | | | | | | |--- weights: [38.67, 1.53] class: 0 | | | | | | | | | | |--- lead_time > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | |--- weights: [130.88, 0.00] class: 0 | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | |--- weights: [1226.99, 1.53] class: 0 | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- avg_price_per_room <= 115.50 | | | | | | | | | |--- avg_price_per_room <= 48.50 | | | | | | | | | | |--- avg_price_per_room <= 46.50 | | | | | | | | | | | |--- weights: [23.05, 1.53] class: 0 | | | | | | | | | | |--- avg_price_per_room > 46.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- avg_price_per_room > 48.50 | | | | | | | | | | |--- avg_price_per_room <= 67.50 | | | | | | | | | | | |--- weights: [162.86, 7.63] class: 0 | | | | | | | | | | |--- avg_price_per_room > 67.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | |--- avg_price_per_room > 115.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- weights: [20.08, 1.53] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- avg_price_per_room <= 121.00 | | | | | | | | | | | |--- weights: [2.23, 10.68] class: 1 | | | | | | | | | | |--- avg_price_per_room > 121.00 | | | | | | | | | | | |--- weights: [14.13, 6.10] class: 0 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- lead_time <= 78.50 | | | | | | | | | |--- avg_price_per_room <= 68.50 | | | | | | | | | | |--- weights: [6.69, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 68.50 | | | | | | | | | | |--- lead_time <= 69.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 69.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- lead_time > 78.50 | | | | | | | | | |--- weights: [46.85, 6.10] class: 0 | | | | | |--- avg_price_per_room > 202.00 | | | | | | |--- arrival_date <= 26.00 | | | | | | | |--- weights: [0.00, 21.37] class: 1 | | | | | | |--- arrival_date > 26.00 | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | |--- no_of_weekend_nights > 1.50 | | | | | |--- no_of_adults <= 1.50 | | | | | | |--- arrival_month <= 6.50 | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | |--- arrival_date <= 8.50 | | | | | | | | | |--- weights: [6.69, 0.00] class: 0 | | | | | | | | |--- arrival_date > 8.50 | | | | | | | | | |--- avg_price_per_room <= 57.50 | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 57.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [0.74, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | |--- avg_price_per_room <= 120.00 | | | | | | | | | |--- lead_time <= 46.00 | | | | | | | | | | |--- weights: [12.64, 0.00] class: 0 | | | | | | | | | |--- lead_time > 46.00 | | | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | | | |--- avg_price_per_room > 120.00 | | | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | | |--- arrival_month > 6.50 | | | | | | | |--- no_of_weekend_nights <= 5.00 | | | | | | | | |--- weights: [72.13, 4.58] class: 0 | | | | | | | |--- no_of_weekend_nights > 5.00 | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | |--- no_of_adults > 1.50 | | | | | | |--- lead_time <= 59.50 | | | | | | | |--- avg_price_per_room <= 32.00 | | | | | | | | |--- weights: [0.74, 3.05] class: 1 | | | | | | | |--- avg_price_per_room > 32.00 | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | |--- lead_time <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 89.00 | | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 89.00 | | | | | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | | | | | |--- lead_time > 0.50 | | | | | | | | | | |--- weights: [232.76, 12.21] class: 0 | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.74, 3.05] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [29.75, 6.10] class: 0 | | | | | | |--- lead_time > 59.50 | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | |--- lead_time <= 60.50 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- weights: [0.00, 16.79] class: 1 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | |--- lead_time > 60.50 | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- weights: [56.52, 4.58] class: 0 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | |--- lead_time <= 79.00 | | | | | | | | | | | |--- weights: [10.41, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 79.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_date > 29.50 | | | | | | | | |--- weights: [0.74, 19.84] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- arrival_month <= 10.50 | | | | | | |--- avg_price_per_room <= 90.50 | | | | | | | |--- avg_price_per_room <= 75.50 | | | | | | | | |--- lead_time <= 98.50 | | | | | | | | | |--- weights: [22.31, 4.58] class: 0 | | | | | | | | |--- lead_time > 98.50 | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | |--- avg_price_per_room <= 58.50 | | | | | | | | | | | |--- weights: [5.95, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 58.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [17.85, 4.58] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [3.72, 12.21] class: 1 | | | | | | | |--- avg_price_per_room > 75.50 | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | |--- weights: [52.05, 3.05] class: 0 | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | |--- lead_time <= 103.00 | | | | | | | | | | | |--- weights: [0.00, 15.26] class: 1 | | | | | | | | | | |--- lead_time > 103.00 | | | | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- avg_price_per_room > 90.50 | | | | | | | |--- arrival_date <= 16.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- avg_price_per_room <= 95.50 | | | | | | | | | | |--- weights: [5.95, 12.21] class: 1 | | | | | | | | | |--- avg_price_per_room > 95.50 | | | | | | | | | | |--- avg_price_per_room <= 108.00 | | | | | | | | | | | |--- weights: [14.13, 1.53] class: 0 | | | | | | | | | | |--- avg_price_per_room > 108.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- avg_price_per_room <= 108.50 | | | | | | | | | | |--- lead_time <= 97.00 | | | | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 97.00 | | | | | | | | | | | |--- weights: [2.97, 76.31] class: 1 | | | | | | | | | |--- avg_price_per_room > 108.50 | | | | | | | | | | |--- avg_price_per_room <= 109.50 | | | | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 109.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_date > 16.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- lead_time <= 91.50 | | | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | | | | | | |--- lead_time > 91.50 | | | | | | | | | | |--- weights: [8.92, 144.98] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | | |--- arrival_month > 10.50 | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | |--- weights: [37.93, 0.00] class: 0 | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | |--- weights: [5.21, 4.58] class: 0 | | | | |--- lead_time > 117.50 | | | | | |--- avg_price_per_room <= 89.50 | | | | | | |--- lead_time <= 126.50 | | | | | | | |--- weights: [33.46, 13.74] class: 0 | | | | | | |--- lead_time > 126.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [134.60, 10.68] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- weights: [1.49, 3.05] class: 1 | | | | | |--- avg_price_per_room > 89.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- avg_price_per_room <= 115.00 | | | | | | | | |--- weights: [58.00, 1.53] class: 0 | | | | | | | |--- avg_price_per_room > 115.00 | | | | | | | | |--- weights: [3.72, 4.58] class: 1 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | |--- arrival_date <= 23.00 | | | | | | | | | |--- avg_price_per_room <= 92.00 | | | | | | | | | | |--- weights: [14.87, 6.10] class: 0 | | | | | | | | | |--- avg_price_per_room > 92.00 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- weights: [13.39, 51.89] class: 1 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | |--- arrival_date > 23.00 | | | | | | | | | |--- weights: [0.00, 36.63] class: 1 | | | | | | | |--- arrival_date > 24.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- weights: [31.23, 1.53] class: 0 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 6.10] class: 1 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 9.50 | | | | |--- lead_time <= 2.50 | | | | | |--- avg_price_per_room <= 202.50 | | | | | | |--- avg_price_per_room <= 60.50 | | | | | | | |--- avg_price_per_room <= 59.50 | | | | | | | | |--- weights: [25.28, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 59.50 | | | | | | | | |--- weights: [0.00, 22.89] class: 1 | | | | | | |--- avg_price_per_room > 60.50 | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | |--- weights: [50.57, 0.00] class: 0 | | | | | | | |--- arrival_month > 1.50 | | | | | | | | |--- arrival_month <= 2.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [23.05, 4.58] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- arrival_date <= 12.00 | | | | | | | | | | | |--- weights: [4.46, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 12.00 | | | | | | | | | | | |--- weights: [3.72, 13.74] class: 1 | | | | | | | | |--- arrival_month > 2.50 | | | | | | | | | |--- no_of_week_nights <= 8.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- no_of_week_nights > 8.50 | | | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | |--- avg_price_per_room > 202.50 | | | | | | |--- weights: [0.74, 16.79] class: 1 | | | | |--- lead_time > 2.50 | | | | | |--- arrival_month <= 9.50 | | | | | | |--- avg_price_per_room <= 96.50 | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | |--- weights: [30.49, 0.00] class: 0 | | | | | | | |--- arrival_month > 1.50 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | |--- weights: [10.41, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | | |--- weights: [20.82, 24.42] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- weights: [4.46, 13.74] class: 1 | | | | | | |--- avg_price_per_room > 96.50 | | | | | | | |--- lead_time <= 3.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 102.50 | | | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | | | | |--- avg_price_per_room > 102.50 | | | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | | | |--- weights: [14.13, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- weights: [2.23, 13.74] class: 1 | | | | | | | |--- lead_time > 3.50 | | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | | |--- lead_time <= 7.50 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- weights: [15.62, 102.25] class: 1 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 7.50 | | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | | |--- weights: [0.00, 10.68] class: 1 | | | | | | | | |--- arrival_month > 8.50 | | | | | | | | | |--- avg_price_per_room <= 124.50 | | | | | | | | | | |--- weights: [5.21, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 124.50 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- weights: [5.95, 16.79] class: 1 | | | | | |--- arrival_month > 9.50 | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [29.75, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 3.50 | | | | | | | | | | | |--- weights: [0.74, 3.05] class: 1 | | | | | | | | | | |--- lead_time > 3.50 | | | | | | | | | | | |--- weights: [19.33, 4.58] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [2.23, 6.10] class: 1 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- weights: [69.90, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | |--- weights: [0.00, 7.63] class: 1 | | | |--- lead_time > 9.50 | | | | |--- arrival_year <= 2017.50 | | | | | |--- lead_time <= 60.50 | | | | | | |--- avg_price_per_room <= 212.00 | | | | | | | |--- avg_price_per_room <= 115.50 | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | |--- weights: [124.93, 9.16] class: 0 | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | | |--- avg_price_per_room > 115.50 | | | | | | | | |--- weights: [20.82, 9.16] class: 0 | | | | | | |--- avg_price_per_room > 212.00 | | | | | | | |--- weights: [0.00, 7.63] class: 1 | | | | | |--- lead_time > 60.50 | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | |--- weights: [26.03, 85.46] class: 1 | | | | | | | |--- arrival_month > 10.50 | | | | | | | | |--- arrival_date <= 19.00 | | | | | | | | | |--- weights: [0.74, 4.58] class: 1 | | | | | | | | |--- arrival_date > 19.00 | | | | | | | | | |--- weights: [7.44, 0.00] class: 0 | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | |--- lead_time <= 85.00 | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | | | | |--- lead_time > 85.00 | | | | | | | | |--- weights: [0.00, 54.94] class: 1 | | | | |--- arrival_year > 2017.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 104.50 | | | | | | | |--- lead_time <= 24.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [39.41, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- arrival_date <= 4.50 | | | | | | | | | | | |--- weights: [2.23, 25.94] class: 1 | | | | | | | | | | |--- arrival_date > 4.50 | | | | | | | | | | | |--- weights: [66.18, 128.20] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [48.34, 0.00] class: 0 | | | | | | | |--- lead_time > 24.50 | | | | | | | | |--- avg_price_per_room <= 55.50 | | | | | | | | | |--- arrival_date <= 1.50 | | | | | | | | | | |--- weights: [0.74, 3.05] class: 1 | | | | | | | | | |--- arrival_date > 1.50 | | | | | | | | | | |--- weights: [23.80, 1.53] class: 0 | | | | | | | | |--- avg_price_per_room > 55.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 16 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 61.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 61.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | |--- avg_price_per_room > 104.50 | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 194.50 | | | | | | | | | | |--- avg_price_per_room <= 130.50 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | | | |--- avg_price_per_room > 130.50 | | | | | | | | | | | |--- weights: [143.52, 1045.41] class: 1 | | | | | | | | | |--- avg_price_per_room > 194.50 | | | | | | | | | | |--- weights: [1.49, 125.14] class: 1 | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | |--- weights: [10.41, 4.58] class: 0 | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | |--- weights: [0.74, 7.63] class: 1 | | | | | | | |--- arrival_month > 10.50 | | | | | | | | |--- lead_time <= 23.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 7.63] class: 1 | | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- weights: [17.85, 1.53] class: 0 | | | | | | | | |--- lead_time > 23.50 | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | | |--- weights: [5.21, 61.05] class: 1 | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- weights: [6.69, 0.00] class: 0 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- weights: [0.74, 6.10] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- no_of_week_nights <= 9.00 | | | | | | | |--- avg_price_per_room <= 209.50 | | | | | | | | |--- weights: [43.87, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 209.50 | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | |--- no_of_week_nights > 9.00 | | | | | | | |--- weights: [0.00, 1.53] class: 1 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- lead_time <= 100.50 | | | | | |--- no_of_week_nights <= 11.00 | | | | | | |--- weights: [699.01, 9.16] class: 0 | | | | | |--- no_of_week_nights > 11.00 | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | |--- lead_time > 100.50 | | | | | |--- lead_time <= 105.00 | | | | | | |--- avg_price_per_room <= 74.50 | | | | | | | |--- weights: [2.97, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 74.50 | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | |--- weights: [0.00, 6.10] class: 1 | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | |--- lead_time > 105.00 | | | | | | |--- weights: [78.08, 13.74] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 6.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_week_nights <= 10.00 | | | | | | | |--- weights: [522.77, 30.52] class: 0 | | | | | | |--- no_of_week_nights > 10.00 | | | | | | | |--- weights: [0.74, 3.05] class: 1 | | | | | |--- lead_time > 4.50 | | | | | | |--- arrival_date <= 17.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 88.00 | | | | | | | | | | |--- weights: [9.67, 1.53] class: 0 | | | | | | | | | |--- avg_price_per_room > 88.00 | | | | | | | | | | |--- arrival_month <= 2.50 | | | | | | | | | | | |--- weights: [0.00, 4.58] class: 1 | | | | | | | | | | |--- arrival_month > 2.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- weights: [16.36, 0.00] class: 0 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- weights: [18.59, 0.00] class: 0 | | | | | | |--- arrival_date > 17.50 | | | | | | | |--- weights: [69.16, 4.58] class: 0 | | | | |--- lead_time > 6.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 117.50 | | | | | | | |--- no_of_weekend_nights <= 2.50 | | | | | | | | |--- lead_time <= 61.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [73.62, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [136.83, 0.00] class: 0 | | | | | | | | |--- lead_time > 61.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | |--- no_of_weekend_nights > 2.50 | | | | | | | | |--- lead_time <= 108.50 | | | | | | | | | |--- weights: [2.97, 30.52] class: 1 | | | | | | | | |--- lead_time > 108.50 | | | | | | | | | |--- lead_time <= 136.00 | | | | | | | | | | |--- weights: [4.46, 0.00] class: 0 | | | | | | | | | |--- lead_time > 136.00 | | | | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | | |--- avg_price_per_room > 117.50 | | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | | |--- lead_time <= 141.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- lead_time > 141.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_month > 8.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- weights: [16.36, 10.68] class: 0 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [36.44, 0.00] class: 0 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | | |--- weights: [0.00, 15.26] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- no_of_week_nights <= 7.50 | | | | | | | |--- weights: [137.57, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 7.50 | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 89.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [1593.60, 0.00] class: 0 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- no_of_week_nights <= 9.50 | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | |--- lead_time <= 8.50 | | | | | | | | |--- weights: [37.93, 0.00] class: 0 | | | | | | | |--- lead_time > 8.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 5.50 | | | | | | | | | | |--- weights: [23.80, 1.53] class: 0 | | | | | | | | | |--- arrival_date > 5.50 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [19.33, 0.00] class: 0 | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | |--- weights: [55.77, 0.00] class: 0 | | | | | |--- no_of_week_nights > 9.50 | | | | | | |--- weights: [0.00, 6.10] class: 1 | | | |--- lead_time > 89.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | |--- weights: [0.74, 15.26] class: 1 | | | | | | | |--- arrival_month > 7.50 | | | | | | | | |--- weights: [8.18, 3.05] class: 0 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 201.50 | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [58.00, 0.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | |--- weights: [0.00, 4.58] class: 1 | | | | | | | |--- avg_price_per_room > 201.50 | | | | | | | | |--- weights: [0.00, 9.16] class: 1 | | | | | |--- arrival_month > 8.50 | | | | | | |--- lead_time <= 93.50 | | | | | | | |--- lead_time <= 90.50 | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | | | |--- lead_time > 90.50 | | | | | | | | |--- lead_time <= 91.50 | | | | | | | | | |--- weights: [0.00, 9.16] class: 1 | | | | | | | | |--- lead_time > 91.50 | | | | | | | | | |--- lead_time <= 92.50 | | | | | | | | | | |--- avg_price_per_room <= 80.00 | | | | | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | | | | | | |--- avg_price_per_room > 80.00 | | | | | | | | | | | |--- weights: [4.46, 0.00] class: 0 | | | | | | | | | |--- lead_time > 92.50 | | | | | | | | | | |--- weights: [0.00, 4.58] class: 1 | | | | | | |--- lead_time > 93.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- avg_price_per_room <= 169.00 | | | | | | | | | |--- avg_price_per_room <= 78.50 | | | | | | | | | | |--- weights: [16.36, 3.05] class: 0 | | | | | | | | | |--- avg_price_per_room > 78.50 | | | | | | | | | | |--- avg_price_per_room <= 87.00 | | | | | | | | | | | |--- weights: [5.21, 10.68] class: 1 | | | | | | | | | | |--- avg_price_per_room > 87.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | |--- avg_price_per_room > 169.00 | | | | | | | | | |--- weights: [7.44, 0.00] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- lead_time <= 117.50 | | | | | | | | | |--- arrival_date <= 21.00 | | | | | | | | | | |--- weights: [8.18, 1.53] class: 0 | | | | | | | | | |--- arrival_date > 21.00 | | | | | | | | | | |--- weights: [0.74, 6.10] class: 1 | | | | | | | | |--- lead_time > 117.50 | | | | | | | | | |--- arrival_date <= 29.00 | | | | | | | | | | |--- weights: [0.00, 10.68] class: 1 | | | | | | | | | |--- arrival_date > 29.00 | | | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [69.16, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.50 | | |--- no_of_special_requests <= 0.50 | | | |--- no_of_adults <= 1.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- avg_price_per_room <= 85.50 | | | | | | | |--- weights: [5.21, 1.53] class: 0 | | | | | | |--- avg_price_per_room > 85.50 | | | | | | | |--- weights: [0.00, 30.52] class: 1 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- lead_time <= 165.00 | | | | | | | | | |--- weights: [52.05, 12.21] class: 0 | | | | | | | | |--- lead_time > 165.00 | | | | | | | | | |--- arrival_month <= 5.00 | | | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 5.00 | | | | | | | | | | |--- weights: [0.00, 16.79] class: 1 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- avg_price_per_room <= 98.00 | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | |--- arrival_date <= 13.00 | | | | | | | | | | | |--- weights: [1.49, 6.10] class: 1 | | | | | | | | | | |--- arrival_date > 13.00 | | | | | | | | | | | |--- weights: [5.21, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | |--- weights: [182.93, 4.58] class: 0 | | | | | | | | |--- avg_price_per_room > 98.00 | | | | | | | | | |--- weights: [0.74, 6.10] class: 1 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- avg_price_per_room <= 88.00 | | | | | | | | |--- weights: [0.00, 12.21] class: 1 | | | | | | | |--- avg_price_per_room > 88.00 | | | | | | | | |--- weights: [9.67, 10.68] class: 1 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 2.50 | | | | | | |--- lead_time <= 280.00 | | | | | | | |--- weights: [5.95, 1.53] class: 0 | | | | | | |--- lead_time > 280.00 | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | |--- avg_price_per_room > 2.50 | | | | | | |--- weights: [1.49, 90.04] class: 1 | | | |--- no_of_adults > 1.50 | | | | |--- avg_price_per_room <= 82.50 | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | |--- lead_time <= 241.50 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- avg_price_per_room <= 69.00 | | | | | | | | | |--- weights: [16.36, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 69.00 | | | | | | | | | |--- lead_time <= 170.00 | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | | |--- lead_time > 170.00 | | | | | | | | | | |--- arrival_date <= 19.00 | | | | | | | | | | | |--- weights: [1.49, 54.94] class: 1 | | | | | | | | | | |--- arrival_date > 19.00 | | | | | | | | | | | |--- weights: [1.49, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- avg_price_per_room <= 66.50 | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | |--- weights: [8.92, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | |--- lead_time <= 191.00 | | | | | | | | | | | |--- weights: [4.46, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 191.00 | | | | | | | | | | | |--- weights: [0.74, 12.21] class: 1 | | | | | | | | |--- avg_price_per_room > 66.50 | | | | | | | | | |--- avg_price_per_room <= 81.50 | | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | | |--- weights: [89.98, 10.68] class: 0 | | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | | | | |--- avg_price_per_room > 81.50 | | | | | | | | | | |--- weights: [1.49, 3.05] class: 1 | | | | | | |--- lead_time > 241.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- avg_price_per_room <= 76.00 | | | | | | | | | | |--- weights: [6.69, 259.44] class: 1 | | | | | | | | | |--- avg_price_per_room > 76.00 | | | | | | | | | | |--- lead_time <= 273.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 273.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [26.03, 0.00] class: 0 | | | | | |--- market_segment_type_Online > 0.50 | | | | | | |--- weights: [2.97, 312.86] class: 1 | | | | |--- avg_price_per_room > 82.50 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- lead_time <= 325.50 | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | |--- weights: [8.18, 1060.67] class: 1 | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 12.21] class: 1 | | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | | |--- weights: [2.97, 0.00] class: 0 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 22.89] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | |--- weights: [0.74, 0.00] class: 0 | | | | | | |--- lead_time > 325.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- weights: [0.00, 10.68] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- weights: [4.46, 0.00] class: 0 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [4.46, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- arrival_date <= 30.50 | | | | | | |--- lead_time <= 156.50 | | | | | | | |--- weights: [4.46, 6.10] class: 1 | | | | | | |--- lead_time > 156.50 | | | | | | | |--- arrival_date <= 1.50 | | | | | | | | |--- weights: [1.49, 3.05] class: 1 | | | | | | | |--- arrival_date > 1.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- arrival_date <= 21.50 | | | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 21.50 | | | | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- weights: [37.93, 0.00] class: 0 | | | | | |--- arrival_date > 30.50 | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | |--- lead_time > 180.50 | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | |--- no_of_adults <= 2.50 | | | | | | | |--- lead_time <= 356.00 | | | | | | | | |--- weights: [14.87, 0.00] class: 0 | | | | | | | |--- lead_time > 356.00 | | | | | | | | |--- weights: [0.00, 1.53] class: 1 | | | | | | |--- no_of_adults > 2.50 | | | | | | | |--- weights: [0.00, 3.05] class: 1 | | | | | |--- market_segment_type_Online > 0.50 | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | |--- avg_price_per_room <= 33.50 | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 33.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- weights: [0.00, 186.19] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- lead_time <= 276.50 | | | | | | | | | | |--- lead_time <= 221.50 | | | | | | | | | | | |--- weights: [0.74, 4.58] class: 1 | | | | | | | | | | |--- lead_time > 221.50 | | | | | | | | | | | |--- weights: [5.21, 0.00] class: 0 | | | | | | | | | |--- lead_time > 276.50 | | | | | | | | | | |--- weights: [1.49, 13.74] class: 1 | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | |--- weights: [8.18, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Online <= 0.50 | | | | | |--- no_of_special_requests <= 1.50 | | | | | | |--- weights: [107.83, 1.53] class: 0 | | | | | |--- no_of_special_requests > 1.50 | | | | | | |--- arrival_date <= 21.50 | | | | | | | |--- weights: [5.95, 0.00] class: 0 | | | | | | |--- arrival_date > 21.50 | | | | | | | |--- lead_time <= 184.00 | | | | | | | | |--- weights: [0.74, 6.10] class: 1 | | | | | | | |--- lead_time > 184.00 | | | | | | | | |--- weights: [2.23, 0.00] class: 0 | | | | |--- market_segment_type_Online > 0.50 | | | | | |--- avg_price_per_room <= 99.50 | | | | | | |--- arrival_date <= 28.50 | | | | | | | |--- lead_time <= 356.50 | | | | | | | | |--- no_of_week_nights <= 5.50 | | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | | |--- avg_price_per_room <= 92.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- avg_price_per_room > 92.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [7.44, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- no_of_week_nights > 5.50 | | | | | | | | | |--- avg_price_per_room <= 65.50 | | | | | | | | | | |--- weights: [3.72, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 65.50 | | | | | | | | | | |--- weights: [7.44, 13.74] class: 1 | | | | | | | |--- lead_time > 356.50 | | | | | | | | |--- weights: [0.00, 4.58] class: 1 | | | | | | |--- arrival_date > 28.50 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- weights: [10.41, 3.05] class: 0 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- avg_price_per_room <= 79.00 | | | | | | | | | |--- weights: [5.21, 1.53] class: 0 | | | | | | | | |--- avg_price_per_room > 79.00 | | | | | | | | | |--- arrival_date <= 30.50 | | | | | | | | | | |--- weights: [8.18, 29.00] class: 1 | | | | | | | | | |--- arrival_date > 30.50 | | | | | | | | | | |--- weights: [2.97, 0.00] class: 0 | | | | | |--- avg_price_per_room > 99.50 | | | | | | |--- weights: [0.00, 19.84] class: 1 | |--- avg_price_per_room > 100.50 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [0.00, 3101.13] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [24.54, 0.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [40.16, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 8.00 | | | | | |--- weights: [2.97, 0.00] class: 0 | | | | |--- arrival_date > 8.00 | | | | | |--- weights: [6.69, 27.47] class: 1
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations from tree:
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.99437 | 0.99437 | 0.91714 |
| Recall | 0.98570 | 0.98570 | 0.93148 |
| Precision | 0.99708 | 0.99708 | 0.83475 |
| F1 | 0.99136 | 0.99136 | 0.88047 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_perf_test.T,],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.86355 | 0.83451 | 0.86033 |
| Recall | 0.79641 | 0.77987 | 0.84212 |
| Precision | 0.78911 | 0.73242 | 0.75833 |
| F1 | 0.79274 | 0.75540 | 0.79803 |
Observations:
What profitable policies for cancellations and refunds can the hotel adopt?
Based on the coefficients in the logistic regression models and the features in the decision-tree models, both prediction models provide evidence that INN Hotels should at least consider separate cancellation and refund policies for its guests travelling for business or personal reasons.
Additionally, in case a hotel is at capacity or overbooked, management could utilize the model to ensure all repeat guests or guests travelling for business reasons have rooms available. Conversely, management can combine predictions from both models to identify the "most likely case" that a booking will be canceled and reallocate that room to a booking for that room category which is the "least likely case".
What other recommedations would you suggest to the hotel?
To further improve the utility of the models, the hotel can provide approximations of the costs related to the outcomes corresponding to true/false positives/negatives. Our team can then optimize the models predicitions to achieve the highest expected profits, versus optimizing for F1 score, which we chose for our evaluation criteria based on the client's use-case.